Authors: Ben Grunwald, Julian Nyarko, John Rappaport

This codebook describes the data and variables used in

   Ben Grunwald, Julian Nyarko & John Rappaport, "Police agencies on Facebook overreport on Black suspects"

# General Remarks:
Please note the following: The first code file, 01_data_prep.R, will create two large files: dta_single.rdata and combined_dta_long_483.rdata. Many of the analyses rely on those two files. However, combined_dta_long_483.rdata is very large (30GB) and the code utilizing it requires a lot of RAM. For user convenience, we thus make available an alternative: You may run 01_data_prep_alt.R, which will only create the smaller file dta_single.rdata. In addition, we make all output files that rely on combined_dta_long_483.rdata available, so that they do not need to be run separately.

To create the Figure S5 (NIBRS), simply run the code as usual, but use nibrs_data.csv instead of arrest_data.csv.

### post_data.csv ###
The file post_data.csv is arranged at the post level. Given our agreement with CrowdTangle, the source of our Facebook data, we are unable to post the raw text of each post. In addition, we are not able to post information on the number of interactions (variable stats_interactions), the number of shares (variable stats_shareCount) or the subscriber count (variable subscriberCount). In the published data, the relevant columns are replaced with "NA". Interested users may pull the text of the original posts and the missing data with the provided post identifiers directly from CrowdTangle. Because this data is censored, 05_fig_s1.R is unable to generate the output file 'all_results_alternative_weights.csv'. A pregenerated version of this file is provided in created_by_large_file. 

The variables are defined as follows:
ori9: agency identifier (ORI9)
id: post identifier
year: year of publication
crime: crime type described in the post
crime_class: crime class described in the post (violent or property)
black: whether the post describes a Black suspects
subscriberCount: the subscriber count at the time the post was made (this variable is replaced with missing values due to licensing restrictions)
stats_interactions: total number of interactions with the post (this variable is replaced with missing values due to licensing restrictions)
stats_shareCount: total number of shares of the post (this variable is replaced with missing values due to licensing restrictions)
arrest_post: whether the post describes an arrest

### arrest_data.csv ###
The file arrest_data.csv is arranged at the arrest level. The variables are defined as follows:
ori9: agency identifier (ORI9)
year: year of arrest
crime: crime type associated with the arrest
crime_class: crime type associated with the arrest (violent or property)
black: whether the arrest involved a Black suspects

### location_data.csv ###
The file location_data.csv is arranged at the ori9 level. The variables are defined as follows:
lng: longitudinal coordinate
lat: latitudinal coordinate
ori9: agency identifier (ORI9)

### nibrs_data.csv ###
The file nibrs_data.csv is arranged at the incident level. The variables are defined as follows:
ori9: agency identifier (ORI9)
year: year of the incident
crime: crime type associated with the incident
crime_class: crime type associated with the incident (violent or property)
black: whether the incident involved a Black suspects

### agency_data.csv ###
The file agency_data.csv is arranged at the ori9 level. The variables are defined as follows:
ori9: agency identifier (ORI9)
agency_name: name of the agency
lon: longitudinal coordinate
lat: latitudinal coordinate
fips: FIPS code
republican_vote: mean republican vote share over the 2012 and 2016 elections
black_pop: Black population share according to the 2010 census
black_officers: share of Black officers

### ori9_fips.csv ###
The file ori9_fips.csv is arranged at the ori9 level. The variables are defined as follows:
ori9: agency identifier (ORI9)
fips: FIPS code

### ori9_population.csv ###
The file ori9_population.csv is arranged at the ori9 level. The variables are defined as follows:
ori9: agency identifier (ORI9)
pop_ave: average population size
pop_grp_over_10k: whether the population size is over 10k (as defined in the paper)
always_report: whether the agency reports data over all months in a year (as defined in the paper)


##
Among the convenience files generated by combined_dta_long_483 are:

### all_exposure_by_ori9.csv ###
Overexposure for all agencies, individualized by crime type/class. The variables are defined as follows:
agency: agency identifier (ORI9)
black_post: distance-weighted share of posts describing a Black suspects
black_arrest: share of arrests involving a Black suspect
overexposure: local overexposure
type: crime type / class


### overexposure_by_ori9.csv ###
Overexposure for agencies across all crime types. The variables are defined as follows:
ori9: agency identifier (ORI9)
agency_name: name of the agency
overexposure: local overexposure
lon: longitudinal coordinate
lat: latitudinal coordinate
republican_vote: mean republican vote share over the 2012 and 2016 elections
black_pop: Black population share according to the 2010 census
black_officers: share of Black officers

### all_results_[...].csv ###
Different output files for regressions using combined_dta_long_483. The variables are defined as follows:
type: crime type
weights: weights used. always "d" for distance.
agency: the posts are weighted at the agency level
estimate: the point estimate
cse: the clustered standard error
se: the unclustered standard error
N: the number of observations
r2: the R2
adjr2: the adjusted R2